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Abstract 


We determine the asymptotic behaviour of the function computed by support vector machines 
(SVM) and related algorithms that minimize a regularized empirical convex loss function in the 
reproducing kernel Hilbert space of the Gaussian RBF kernel, in the situation where the number 
of examples tends to infinity, the bandwidth of the Gaussian kernel tends to 0, and the regular- 
ization parameter is held fixed. Non-asymptotic convergence bounds to this limit in the Lz sense 
are provided, together with upper bounds on the classification error that is shown to converge to 
the Bayes risk, therefore proving the Bayes-consistency of a variety of methods although the reg- 
ularization term does not vanish. These results are particularly relevant to the one-class SVM, for 
which the regularization can not vanish by construction, and which is shown for the first time to be 
a consistent density level set estimator. 


Keywords: regularization, Gaussian kernel RKHS, one-class SVM, convex loss functions, kernel 
density estimation 


1. Introduction 
Given n independent and identically distributed (i.i.d.) copies (X1,Y1),..-, (Xn, Yn) of a random vari- 


able (X,Y) € R? x {—1,1}, we study in this paper the limit and consistency of learning algorithms 
that solve the following problem: 


agma EÉ ocx) ali, (1) 
i=l 
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where ¢ : R — R is a convex loss function and Hg is the reproducing kernel Hilbert space (RKHS) 
of the normalized Gaussian radial basis function kernel (denoted simply Gaussian kernel below): 


; 1 LEL) 
ko(x,x =— (St ,o>0. (2) 
ey. = 


This framework encompasses in particular the classical support vector machine (SVM) (Boser et al., 
1992) when (uw) = max(1 —u,0) (Theorem 6). Recent years have witnessed important theoretical 
advances aimed at understanding the behavior of such regularized algorithms when n tends to in- 
finity and À decreases to 0. In particular the consistency and convergence rates of the two-class 
SVM (see, e.g., Steinwart, 2002; Zhang, 2004; Steinwart and Scovel, 2004, and references therein) 
have been studied in detail, as well as the shape of the asymptotic decision function (Steinwart, 
2003; Bartlett and Tewari, 2004). The case of more general convex loss functions has also attracted 
a lot of attention recently (Zhang, 2004; Lugosi and Vayatis, 2004; Bartlett et al., 2006), and been 
shown to provide under general assumptions consistent procedure for the classification error. 

All results published so far, however, study the case where A decreases as the number of 
points tends to infinity (or, equivalently, where A07? converges to 0 if one uses the classical non- 
normalized version of the Gaussian kernel instead of (2)). Although it seems natural to reduce 
regularization as more and more training data are available — even more than natural, it is the spirit 
of regularization (Tikhonov and Arsenin, 1977; Silverman, 1982) —, there is at least one important 
situation where A is typically held fixed: the one-class SVM (Schdélkopf et al., 2001). In that case, 
the goal is to estimate an a-quantile, that is, a subset of R? of given probability œ with minimum 
volume. The estimation is performed by thresholding the function output by the one-class SVM, 
that is, the SVM (1) with only positive examples; in that case A is supposed to determine the quan- 
tile level.! Although it is known that the fraction of examples in the selected region converges to 
the desired quantile level œ (Schélkopf et al., 2001), it is still an open question whether the region 
converges towards a quantile, that is, a region of minimum volume. Besides, most theoretical re- 
sults about the consistency and convergence rates of two-class SVM with vanishing regularization 
constant do not translate to the one-class case, as we are precisely in the seldom situation where the 
SVM is used with a regularization term that does not vanish as the sample size increases. 

The main contribution of this paper is to show that Bayes consistency for the classification error 
can be obtained for algorithms that solve (1) without decreasing A, if instead the bandwidth o of 
the Gaussian kernel decreases at a suitable rate. We prove upper bounds on the convergence rate 
of the classification error towards the Bayes risk for a variety of functions ọ and of distributions P, 
in particular for SVM (Theorem 6). Moreover, we provide an explicit description of the function 
asymptotically output by the algorithms, and establish converge rates towards this limit for the L2 
norm (Theorem 7). In particular, we show that the decision function output by the one-class SVM 
converges towards the density to be estimated, truncated at the level 2A (Theorem 8); we finally show 
(Theorem 9) that this implies the consistency of one-class SVM as a density level set estimator for 
the excess-mass functional (Hartigan, 1987). 

This paper is organized as follows. In Section 2, we set the framework of this study and state 
the main results. The rest of the paper is devoted to the proofs and some extensions of these results. 
In Section 3, we provide a number of known and new properties of the Gaussian RKHS. Section 4 





1. While the original formulation of the one-class SVM involves a parameter v, there is asymptotically a one-to-one 
correspondence between A and v. 
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is devoted to the proof of the main theorem that describes the speed of convergence of the regu- 
larized o-risk of its empirical minimizer towards its minimum. This proof involves in particular a 
control of the sample error in this particular setting that is dealt with in Section 5. Section 6 relates 
the minimization of the regularized -risk to more classical measures of performance, in particular 
classification error and Lz distance to the limit. These results are discussed in more detail in Sec- 
tion 7 for the case of the 1- and 2-SVM. Finally the proof of the consistency of the one-class SVM 
as a density level set estimator is postponed to Section 8. 


2. Notation and Main Results 


Let (X,Y) be a pair of random variables taking values in R? x {—1,1}, with distribution P. We 
assume throughout this paper that the marginal distribution of X has a density p : R? — R with 
respect to the Lebesgue measure, and that its support is included in a compact set x C Rf. Letn : 
R? — [0,1] denote a measurable version of the conditional distribution of Y = 1 given X. The 
function 2n — 1 then corresponds to the so-called regression function. 

The normalized Gaussian radial basis function (RBF) kernel kg with bandwidth parameter o > 0 
is defined for any (x,x’) € Rf x Rf by:? 


1 -l|x-x' |? 
ex := exp ( : 
( 770)" 20? 


the corresponding reproducing kernel Hilbert space (RKHS) is denoted by Ho, with associated 
norm ||. ||,;,. Moreover let 





Ko i= |lkoll1. = 1/( no) (3) 


Several useful properties of this kernel and its RKHS are gathered in Section 3. 
Denoting by M the set of measurable real-valued functions on Rf, we define several risks for 
functions f E€ M : 


e The classification error rate, usually ref 
R(f) = P (sign (f(X)) #Y) , 
and the minimum achievable classification error rate over M is called the Bayes risk: 


R* := inf R(f). 
fEM 


e For a scalar A > 0 fixed throughout this paper and a convex function ọ : R — R, the -risk 
regularized by the RKHS norm is defined, for any © > 0 and f € Ho, by 


Roo (f) = EP lo Y F (XD HAI F lla: 


and the minimum achievable Rọ, o-risk over Hg is denoted by 


Rý .:= inf R 
0,6 few 0,6 (f) 

















2. We refer the reader to Section 3.2 for a brief discussion on the relation between normalized/unnormalized Gaussian 
kernel. 
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Furthermore, for any real r > 0, we know that is Lipschitz on [—r,r], and we denote by L (r) 
the Lipschitz constant of the restriction of to the interval [—r,r]. For example, for the 
hinge loss 6(w) = max(0, 1 — u) one can take L(r) = 1, and for the squared hinge loss o(u) = 
max(0, 1 — u)? one can take L(r) = 2(r+1). 


Finally, the L2-norm regularized 0-risk is, for any f E€ M : 














Roo (f) := Ep [0 Yf XD] +All fz, 


where, 


IF li, = f, Fads € (0, +9, 
and the minimum achievable Ry o-risk over M is denoted by 
Roo := inf R < œ% 
pOr Ant Reo (f) 
As we shall see in the sequel, the above notation is consistent with the fact that Rg o is the 
pointwise limit of Rg, as © tends to zero. 


Each of these risks has an empirical counterpart where the expectation with respect to P is replaced 
by an average over an i.i.d. sample T := {(X1,Y1),...,(Xn,Yn)}. In particular, the following empir- 
ical version of Rg, will be used 


M= 


" 1 
V6 >0,f € He, Roo(f)= — Eo SAD HMS lg 


1 


Furthermore, ho denotes the minimizer of Ric over Ho (see Steinwart, 2005a, for a proof of 
existence and uniqueness of tea): 

The main focus of this paper is the analysis of learning algorithms that minimize the empir- 
ical -risk regularized by the RKHS norm Roo and their limit as the number of points tends to 
infinity and the kernel width o decreases to 0 at a suitable rate when n tends to oo, À being kept 
fixed. Roughly speaking, our main result shows that in this situation, the minimization of Roo 
asymptotically amounts to minimizing Ryo. This stems from the fact that the empirical average 
term in the definition of Ros converges to its corresponding expectation, while the norm in Hg of a 
function f decreases to its L} norm when o decreases to zero. To turn this intuition into a rigorous 
statement, we need a few more assumptions about the minimizer of Rg and about P. First, we 
observe that the minimizer of Rg is indeed well-defined and can often be explicitly computed (the 
following lemma is part of Theorem 26 and is proved in Section 6.3): 


Lemma 1 (Minimizer of R99) For any x € RS, let 
fo.o(x) = arn Lame) + (1 -naaa . 


Then fọ o is measurable and satisfies: 


Roo (foo) = D Roo (f) 
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Second, let us recall the notion of modulus of continuity (DeVore and Lorentz, 1993, p.44): 


Definition 2 (Modulus of Continuity) Let f be a Lebesgue measurable function from R? to R. 
Then its modulus of continuity in the L,-norm is defined for any 6 > 0 as follows 


a(f,5):= sup ||f-+0)—fC) Ilo > (4) 


Os||t||<6 
where ||t || is the Euclidean norm of t € R¢. 


The main result of this paper, whose proof is postponed to Section 4, can now be stated as follows: 


Theorem 3 (Main Result) Let 6, > © > 0, 0 < p < 2, 8>0, and let fẹo denote the minimizer 
of the Ryo risk over Ho, where ẹ is assumed to be convex. Assume that the marginal density p 
is bounded, and let M := sup ega p(x). Then there exist constants (Ki)i=1..4 (depending only 
on p, 5, à, d, and M) such that the following holds with probability greater than 1 — e™ over 
the draw of the training data 


[2+(2—p)(1+8)]d 


7 oo (0 zy 1 He 1\ 7 
Ro.0(fo,0) — Roo < KL ( 4) (=) G) 
Ko (0)\ (1\4x 
+ KoL x (=) m (5) 


1 
T K40(fo,0; 01) ’ 











where L(r) still denotes the Lipschitz constant of on the interval |—r,r], for any r > 0. 


The first two terms in r.h.s. of (5) bound the estimation error (also called sample error) asso- 
ciated with the Gaussian RKHS, which naturally tends to be small when the number of training 
data increases and when the RKHS is ’small’, i.e., when © is large. As is usually the case in such 
variance/bias splittings, the variance term here depends on the dimension d of the input space. Note 
that it is also parametrized by both p and 6. These two parameters come from the bound (36) on 
covering numbers that is used to derive the estimation error bound (31). Both constants Kı and K2 
depend on them, although we do not know the explicit dependency. The third term measures the 
error due to penalizing the L2-norm of a fixed function in Ho, by its ||. ||,,,-norm, with 0 < © <6}. 
This is a price to pay to get a small estimation error. As for the fourth term, it is a bound on the 
approximation error of the Gaussian RKHS. Note that, once A and o have been fixed, 6; remains a 
free variable parameterizing the bound itself. 


From (5), we can deduce the Rg o-consistency of ho for Lipschitz loss functions, as soon as f 0 
is integrable, © = o (n—1/(4+)) for some £ > 0, and ©; — 0 with o/o, — 0. Now, in order to 
highlight the type of convergence rates one can obtain from Theorem 3, let us assume that the loss 
function is Lipschitz on R (e.g., take the hinge loss), and suppose that for some 0 < B < 1, cı > 0, 
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and for any h > 0, fọ o satisfies the following inequality:° 
@(fo.0,8) < c1? . (6) 


Then the right-hand side of (5) can be optimized w.r.t. 61, 6, p and 8 by balancing the first, third 
and fourth terms (the second term having always a better convergence rate than the first one). For 
any € > 0, by choosing: 


SH a, 
Dea Ay 
2d+dp—B 


2+ß 
1 4B+(2+B)d+e 
o=|{ —- : 
n 


2 
aoe, 1 \ PHHP dHe 
01 = o2t8 = — 5 


p= 


n 


the following rate of convergence is obtained: 


% s 1 WOBR 
Roo (foo) — Ro. = OP 3 


This shows in particular that, whatever the values of B and d, the convergence rate that can be derived 
from Theorem 3 is always slower than 1/,/n, and it gets slower and slower as the dimension d 
increases. 

Theorem 3 shows that, when p is convex, minimizing the Row risk for well-chosen width o 
is a an algorithm consistent for the Ro o-risk. In order to relate this consistency with more tradi- 
tional measures of performance of learning algorithms, the next theorem shows that under a simple 
additional condition on @, Rọ, o-risk-consistency implies Bayes consistency: 


Theorem 4 (Relating Ry o-Consistency with Bayes Consistency) /f is convex, differentiable at 0, 
with o'(0) < 0, then for every sequence of functions (f;);5; © M, 


jim Roo (fi) =Ro0 => ni (fi) =R* 
This theorem results from a more general quantitative analysis of the relationship between the ex- 
cess Rọ,o-risk and the excess R-risk (Theorem 28), and is proved in Section 6.5. In order to state a 


refined version of it in the particular case of the support vector machine algorithm, we first need to 
introduce the notion of low density exponent: 


Definition 5 We say that a distribution P with marginal density p w.r.t. Lebesgue measure has a 
low density exponent y > 0 if there exists (c2,£0) € (0,9%)? such that 


Ve € [0,£0], P({xe R’ : p(x) < e}) < oe". 





3. For instance, it can be shown that the indicator function of the unit ball in R@, albeit not continuous, satisfies (6) 
with B = 1. 
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We are now in position to state a quantitative relationship between the excess Rg -risk and the 
excess R-risk in the case of support vector machines : 


Theorem 6 (Consistency of SVM) Let ; (0) := max (1 — &,0) be the hinge loss function and let 
2() := max (1 — 0,0)? be the squared hinge loss function. Then for any distribution P with low 
density exponent y, there exist constant (K1, K2,r1,r2) € (0,-)* such that for any f € M with an 
excess Rọ, o-risk upper bounded by r; the following holds: 


R(f) —R* < Kı (Ro, o(f) — R$, 0) , 


and if the excess regularized Rọ, o-risk upper bounded by rp the following holds: 


R(f) —R* < Ko (Ron 2(f) -Rh 2) "F, 


This theorem is proved in Section 7.3. In combination with Theorem 3, it states the consistency of 
SVM, and gives upper bounds on the convergence rates, for the first time in a situation where the 
effect of regularization does not vanish asymptotically. In fact, Theorem 6 is a particular case of 
a more general result (Theorem 29) valid for a large class of convex loss functions. Section 6 is 
devoted to the analysis of the general case through the introduction of variational arguments, in the 
spirit of Bartlett et al. (2006). 

Another consequence of the Rg o-consistency of an algorithm is the Lz convergence of the func- 
tion output by the algorithm to the minimizer of the Rg o-risk : 


Lemma 7 (Relating Ry o-Consistency with L>-Consistency) For any f € M , the following holds: 


1 
IF- folla < x (RoolF) — Roo) - 


This result is the third statement of Theorem 26, proved in Section 6.3. It is particularly relevant 
to study algorithms whose objective is not binary classification. Consider for example the one- 
class SVM algorithm, which served as the initial motivation for this paper. Then we can state the 
following result, proved in Section 8.1 : 


Theorem 8 (Z2-Consistency of One-Class SVM) Let p} denote the function obtained after trun- 
cating the density: 


p(x) . 
ee if p(x) < 2A, g 


1 otherwise. 


Let f denote the function output by the one-class SVM: 


A 


1 n 
fo: arg min = O(F(%)) +All Fl 
1 


EH NI 
Then, under the general conditions of Theorem 3, and assuming that limp_.9 © (Ph, h) = 0, 
lim || fo —Pallz =0, in probability, 
n—- oo 


for a well-calibrated bandwidth o. 
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In this and the next theorem, well-calibrated refers to any choice of bandwidth o that ensures Rg o- 
consistency, as discussed after Theorem 3. A very interesting by-product of this theorem is the 
consistency of the one-class SVM algorithm for density level set estimation, which to the best of 
our knowledge has not been stated so far (the proof being postponed to Section 8.2) : 


Theorem 9 (Consistency of One-Class SVM for Density Level Set Estimation) Let 0 < u< 2A < 
M, let C, be the level set of the density function p at level u: 


Cyi= {x eR" ; pu}, 


and G be the level set of 2\fg at level u: 


C,:= frer’ : fsx) > =} : 
where fo is still the function output by the one-class SVM. For any distribution Q, for any subset C 
of RI, define the excess-mass of C with respect to Q as follows: 


Ho (C) := Q (C) — uLeb (C) . 


Then, under the general assumptions of Theorem 3, and assuming that limp,—0@ (pP, h) = 0, we 
have 
lim Hp (Cu) — Hp (G,) =0, in probability, 


n— +2% 


for a well-calibrated bandwidth o. 


The excess-mass functional was first introduced by Hartigan (1987) to assess the quality of 
density level set estimators. It is maximized by the true density level set C, and acts as a risk 
functional in the one-class framework. The proof of Theorem 9 is based on the following general 
result: if 6 is a density estimator converging to the true density p in the Lz sense, then for any 
fixed 0 < u < suppa {p}, the excess mass of the level set of 6 at level u converges to the excess 
mass of C,,. In other words, as is the case in the classification framework, plug-in rules built on L2- 
consistent density estimators are consistent with respect to the excess mass. 


3. Some Properties of the Gaussian Kernel and its RKHS 


This section presents known and new results about the Gaussian kernel kg and its associated RKHS Ho, 
that are useful for the proofs of our results. They concern the explicit description of the RKHS norm 
in terms of Fourier transforms, its relation with the L2-norm, and some approximation properties 
of convolutions with the Gaussian kernel. They make use of basic properties of Fourier transforms 
which we now recall (and which can be found in e.g. Folland, 1992, Chap. 7, p.205). 

For any f in Lı (R), its Fourier transform fF [f] : R? — R is defined by 


o= fe Fada. 


If in addition F [f] € Lı (Rf), f can be recovered from F [f] by the inverse Fourier formula: 


1 
FA) = Gaya [FM oe?rdo. 
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Finally Parseval’s equality relates the L7-norm of a function and its Fourier transform if f € Lı (RË) N 
Ly(R®) and F [f] € L2(R*): 
1 


EE 2 
Il fllz, = onal * A] Ili, - (8) 


3.1 Fourier Representation of the Gaussian RKHS 
For any u € Rf, the expression kg(u) denotes kg(0,), with Fourier transform known to be: 


-0° 0]? 
2 


F [ke] (0) =e (9) 


The general study of translation invariant kernels provides an accurate characterization of their 
associated RKHS in terms of the their Fourier transform (see, e.g., Matache and Matache, 2002). In 
the case of the Gaussian kernel, the following holds : 


Lemma 10 (Characterization of Ho) Let Co(IR“) denote the set of continuous functions on R¢ that 
vanish at infinity. The set 


|||? 


Hy := fy € Co(R") : f € Li(R°) and f IF [f] (0) e? do < -| (10) 


is the RKHS associated with the Gaussian kernel kg, and the associated dot product is given for 
any f,g E Ho by 


lol? 
2 


sha = Baa [oF OF oe ao, aD 


where a* denotes the conjugate of a complex number a. In particular the associated norm is given 


for any f E€ Hg by 
lol? 


1 
IF eee = Gaya Ja f OP do. (12) 





This lemma readily implies several basic facts about Gaussian RKHS and their associated norms 
summarized in the next lemma. In particular, it shows that the family (#/5),.9 forms a nested 
collection of models, and that for any fixed function, the RKHS norm decreases to the L2-norm as 
the kernel bandwidth decreases to 0. 


Lemma 11 The following statements hold: 


1. ForanyO0 <6 <7, 


Hy C Ho C L2(R¢). (13) 

Moreover, for any f € Hy, 
II Foe 2 WF l 2 NF lz (14) 

and 
2 2 2 2 

o< Ie -IA < > (Mrr) - a5) 

2. For any T > Qand f € H; 
tim || f lla = lI f lle - (16) 
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3. For any © > O and f € Ho, 
IF llre S VKoll f llag- (17) 


Proof Equations 13 and 14 are direct consequences of the characterization of the Gaussian RKHS (12) 
and of the observation that 








In order to prove (15), we derive from (12) and Parseval’s equality (8): 





1 o2||@||2 
1718 = ae fl oye [EP -1]ao. (18) 


For any 0 < u < v, we have (e” — 1)/u < (e” —1)/v by convexity of e”, and therefore: 


2 |||? 


1 2 
1AN -IAI < g a fy lF OR fe” -1| do, (19 


which leads to (15). Equation 16 is now a direct consequence of (15). Finally, (17) is a classical 
bound derived from the observation that, for any x € RI, 


ILI =| rko) | SMF llorell ke lls = voll f lla - 


3.2 Links with the Non-Normalized Gaussian Kernel 


It is common in the machine learning literature to work with a non-normalized version of the Gaus- 
sian RBF kernel, namely the kernel: 


_ — yw y2 
ko (x,x’) := exp (1) ; (20) 


202 


From the relation kg = Koko (remember that Kg is defined in Equation 3), we deduce from the 
general theory of RKHS that Ho = Ho and 


YFE Ho, If lla, = Voll floc - ie 


As a result, all statements about ko and its RKHS easily translate into statements about ko and its 
RKHS. For example, (14) shows that, for any 0 < © < T and f E€ Hz, 


d 
Ky (0) 2: 
Illi 2 yell = (F) Wiig 
a result that was shown recently (Steinwart et al., 2004, Corollary 3.12). 
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3.3 Convolution with the Gaussian Kernel 


Besides its positive definiteness, the Gaussian kernel is commonly used as a kernel for function 
approximation through convolution. Recall (Folland, 1992) that the convolution between two func- 
tions f,g € Ly (R?) is the function f xg € Lı (R?) defined by 


fae(eyi= | fæ- 
and that it satisfies (see e.g. Folland, 1992, Chap. 7, p.207) 


F [fxg] =F [Ff] F [gl - (22) 


The convolution with a Gaussian RBF kernel is a technically convenient tool to map any square in- 
tegrable function to a Gaussian RKHS. The following lemma (which can also be found in Steinwart 
et al., 2004) gives several interesting properties on the RKHS and LZ norms of functions smoothed 
by convolution: 


Lemma 12 For any © > 0 and any f € Lı (R“) A L (R°), 
ko* f € H Ao 


and 
lko * f lla, 5 IS lz - (23) 


For any 6,t > 0 that satisfy 0 < © < V21, and for any f € L (R?) NL» (RI), 
kee f E€ Ho 


and 


lk Flle Ei lk: fll}, < ISIB, : (24) 


<5 


Proof Using (12), then (22) and (9), followed by Parseval’s equality (8), we compute: 


|| ko * f loin = oy oar |, | F [ko * f] (@ w)|7e o* || o||? do 


|? —97 || ||? ,07|| |? 
oe aya ful WU ane = 
2 
= ogi u lF IOo 
Sie 


This proves the first two statements of the lemma. 
Now, because 0 < © < V21, previous first statement and (13) imply 


kr* f E HJ C Ho, 
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and, using (15) and (23), 


o2 


Iket e lex Sl, < 5 (Ike f ie kxf, 


o? ; 
<e 
< agile fll, 
2 
o 2 
= zalf lt . 


A final result we need is an estimate of the approximation properties of convolution with the 
Gaussian kernel. Convolving a function with a Gaussian kernel with decreasing bandwidth is known 
to provide an approximation of the original function under general conditions. For example, the as- 
sumption f € Lı (R1) is sufficient to show that || ks « f — f ||, goes to zero when © goes to zero (see, 
for example Devroye and Lugosi, 2000, page 79). We provide below a more quantitative estimate 
for the rate of this convergence under some assumption on the modulus of continuity of f (see 
Definition 2), using methods from DeVore and Lorentz (1993, Chap.7, par.2, p.202). 


Lemma 13 Let f be a bounded function in L; (RÎ). Then for all © > 0, the following holds: 
lko* f- fllu <U+Vvd)a(f,o) , 
where œ(f,.) denotes the modulus of continuity of f in the Lı norm. 


Proof Using the fact that ko is normalized, then Fubini’s theorem and then the definition of œ, the 
following can be derived 








Ikoxf-fla = fal f OSEH- fdr] dx 
< f, fa toOIE+)- fa)ldrdx 
= f rO| f iree+)—replas] a 
< f KOIK- fO ldt 
< f KoD, leI 








Now, using the following subadditivity property of œ (DeVore and Lorentz, 1993, Chap.2, par.7, 
p.44): 
o(f,51 +82) < @(f,51) +O(F,52) , 81,82 >0, 


the following inequality can be derived for any positive À and ô: 
w(f,As) < +A), ò) . 
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Applying this and also Cauchy-Schwarz inequality leads to 


lkof-flu < k iy + LI o(f.o)te(ear 
z (7,0) |1 +f, le [|ko(t "l 


le? 
1+- Í z 207 dt 
=( fl | TT a l 


wa) 

: os = cue. a a) 
e 

i 


A 





IA 





A one N a) 


2 


[p= a seat du) 


Ra 


a o(f,0) 





The integral term is exactly the variance of a Gaussian random variable, namely 07. Hence we end 
up with 
lko* f- fll < (1+ Va)o(f,0) . 


4. Proof of Theorem 3 


The proof of Theorem 3 is based on the following decomposition of the excess Rg o-risk for the 
minimizer of the Ro,o-tisk: 


Lemma 14 For any 0 < © < V20, and any sample (X;,Y;);_; 
fies: 


w the minimizer fyo of Rọ, o satis- 


pey 


Roo (As)- — Ro, o> < [Ro, o(fo, o)— Roo] 
+ [Ro,o(ke, * fo,0) — Roo (ko: * fo.0) | (25) 
+ [Ro.o(ke, * f9.0) — Foo] 


Proof The excess Ry o-risk decomposes as follows: 





Roo(fos)—Roo = [Roo (foc) — Roo (foc) 
+ [Ro, olf, o) z Ro, ol 
+ [Rio —Ro.o(ks, * fo.0) | 
+ [Ro,o(ko, * fo,0) — Ro,0(ko, * fo,0)| 
+ [Roo(ko, * fo.0) — Roo] - 
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Note that by Lemma 12, kg, * fo,0 E H Vo, © Hoe Clo (R ) which justifies the introduction of Rọ,o (ko, * 
foo) and Roo(ke, * fo,0). Now, by definition of the different risks using (14), we have 


Roo (Êo) -Ros (foo) => (foc liz. — foo lb) <0, 


and 
Roo — Ro oko; * fo,0) <0. 
E 


Hence, controlling Rọ,o( h.o) — Roo to prove Theorem 3 boils down to controlling each of the three 
terms arising in (25), which can be done as follows: 


e The first term in (25) is usually referred to as the sample error or estimation error. The 
control of such quantities has been the topic of much research recently, including for exam- 
ple Tsybakov (1997); Mammen and Tsybakov (1999); Massart (2000); Bartlett et al. (2005); 
Koltchinskii (2003); Steinwart and Scovel (2004). Using estimates of local Rademacher com- 
plexities through covering numbers for the Gaussian RKHS due to Steinwart and Scovel 
(2004), we prove below the following result 


Lemma 15 For any © > 0 small enough, let fio be the minimizer of the Ry, o-risk on a sample 


of size n, where 6 is a convex loss function. For any 0 < p <2, 6>0, and x > 1, the following 
holds with probability at least 1 — e* over the draw of the sample: 


OY RE yo 
Roo (ha) -Rha < Kit ( T ) (5) (3) 


4 OO 


where K; and K, are positive constants depending neither on ©, nor on n. 











e The second term in (25) can be upper bounded by 
(0) 0° 


2 
207 





Indeed, using Lemma 12, and the fact that © < /20}, we have 
Roolko; * fo,0) — Ro.o(ko, * foo) =A [i ke, * foo lla, — llko: * foo liz, 
ho? ‘ 
< — ; 
< alol, 


Since fọ, minimizes Roo, we have Roo (fo,0) < Roo (0), which leads to || fo,0 lZ, < (0) /À. 
Therefore, we have 


0) o? 
Roo (koi * fo.0) — Roo (ko: * fo.0) < aie 
1 
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e The third term in (25) can be upper bounded by 


(24l| fool. +L (\I foo lls.) M) (1+ Vd) (90,01) - 
Indeed, 
Ro.0(ko, * fo,0) — Ro,0(fo.0) 


A [Il ko: * o,o lz, — Il fo.0 liz, ] + [Ep [0 (Y (ko, * fo,0)(X))] — Er [0 (¥ fo.0(X))]] 
A (ko * fo,0 — 0,0; ko, * foo + foo) 1, + Ep [0 (Y (ko, * fo0)(X)) — 0 (Y foo(X))] - 


Now, since || ko, * fo, llt- < || foo Ilza || Ko: Ilzi = || foo Iz... then using Lemma 13, we obtain: 


Ro0(ke, * foo) —Ro0( foo) < 2Al| foo lleol] ke: * foo — fo.0 llr 
+L (|| foo liz) Ep [| (Ko, * f9,0)(X) — fo,0(X)]] 
(2A fo.0 llt +L (Ilfo. lle.) M) || ko, * foo — foo lle, 


(2All foo lla. +L (I foo ll.) M) (1+ Vd) ©(fy0,61) , 


where M := sup,<pa P(x) is supposed to be finite. 





















































IA IA 


Now, Theorem 3 is proved by plugging the last three bounds in (25). a 


5. Proof of Lemma 15 (Sample Error) 


The present section is divided into two subsections: the first one presents the proof of Lemma 15, 
and the auxiliary lemmas that are used in it are then proved in the second subsection. 


5.1 Proof of Lemma 15 


In order to upper bound the sample error, it is useful to work with a set of functions as “small” as 
possible, in a meaning made rigorous below. Although we study algorithms that work on the whole 
RKHS 4H a priori, let us first show that we can drastically “downsize” it. 

Indeed, recall that the marginal distribution of P in X is assumed to have a support included 
in a compact x C R. The restriction of kg to X, denoted by kz, is a positive definite kernel 


on X (Aronszajn, 1950) with RKHS defined by: 


Hp = { fix :fE Ho} ; (26) 
where fix denotes the restriction of f to x, and RKHS norm: 
VAT EHS, MF lag = inf {Il lsg : f E Ho and fix = f*} . 27 


For any f* € HŠ consider the following risks: 
Rof”) = Ep, [o YF (X))] FAP? Meee > 


x lz 
Roof") = ~ ho (EF (Xi) HMS Wop - 
i=1 














We first show that the sample error is the same in Hg and H£: 
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Lemma 16 Let fj, and dea be respectively the minimizers of Ry , and RK. Then it holds almost 
surely that 


From Lemma 16 we deduce that a.s., 


Roo (foc) — Ros (foc) = Roo (foo) — Roc (foo) - (28) 


In order to upper bound this term, we use concentration inequalities based on local Rademacher 
complexities (Bartlett et al., 2006, 2005; Steinwart and Scovel, 2004). In this approach, a crucial role 
is played by the covering number of a functional class F under the empirical L2-norm. Remember 
that for a given sample T := {(X1,Y1),.--,(Xn,¥,)} and € > 0, an e-cover for the empirical Ly norm 
is a family of function (f;),<; such that: 


1 
2 


YfEF, ET, ( E ua-sa) <e. 


n Z 








The covering number N (F ,€,L2(T)) is then defined as the smallest cardinal of an €-cover. 
We can now mention the following result, adapted to our notation and setting, that exactly fits 
our need. 


Theorem 17 (see Steinwart and Scovel, 2004, Theorem 5.8.) For © > 0, let Fo be a convex sub- 
set of HŠ and let » be a convex loss function. Define Go as follows: 


Go = ferd) = OOF) HMS [Beg -OOR -M feller 1 FE Fo} — 29) 


where tee minimizes Res over Fo. Suppose that there are constants c > 0 and B > 0 such that, for 
all g € Go, 




















LP [2°] < cEp |e] ; 








and 
elle. <B. 


Furthermore, assume that there are constants a > 1 and 0 < p <2 with 
sup log N (B~'Go,€,L2(T)) < ag”? (30) 
Tez" 


for all € > 0. Then there exists a constant cp > 0 depending only on p such that for all n > 1 and 
all x > 1 we have 


Prez: Re o(feo) Reig) + cp&(n,a,B,c,x)) < e™, (31) 


where 





2 

2p. 2-p Tp 
e(n,a,B,c, p,x) = (B+ BP ) (<)™ + (B+e)= 
n n 
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Note that we use the outer probability Pr* in (31) because the argument is not necessarily measur- 
able. From the inequalities || fj, We < 0(0)/A and ff Woe < 0(0)/A, we see that it is enough 
to take 


0 
Fo= i , (32) 
where 82 is the unit ball of H£ , to derive a control of (28) from Theorem 17. In order to apply this 
theorem we now provide uniform upper bounds over Gg for the variance of g and its uniform norm, 


as well as an upper bound on the covering number of Go. 


Lemma 18 For allo > 0, forall g E€ Go, 



































2 
Er [8°] < (-( 0) vV% +290) “Ep [gl - (33) 
Let us fix 
B=2L (y 4) y KaD) | a(0). (34) 
A A 
Then, the following two lemmas can be derived: 
Lemma 19 For all © >Q, forall g € Go, 
la llre <B. (35) 
Lemma 20 For allo > 0,0 < p<2,65>0,¢€>0, the following holds: 
log N (B7! Go,£,La(T)) < ego CPF A)de—P , (36) 


where cı and c3 are constants that depend neither on ©, nor on € (but they depend on p, 6, d and i). 


Combining now the results of Lemmas 18, 19 and 20 allows to apply Theorem 17 with Fo defined 
by (32), any p € [0,2], and the following parameters: 


2 
ee [ (y 4) e+ 2780) > , 


a = czo (l -2/D0+8))d 











from which we deduce Lemma 15. E 
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5.2 Proofs of Auxiliary Lemmas 


Proof of Lemma 16 Because the support of P is included in x, the following trivial equality holds: 


Vf € Ho, Ep[>(¥f(X))] = En, [$ Y fix (X))] - (37) 
Using first the definition of the restricted RKHS (26), then (37) and (27), we obtain 


Roo(foo) = inf Ep, o O) +All PF la 


f* CHS 
= inf Ep, [0 (Y fix (X))] +All fix lacz 

fEHo (38) 
= inf Ep [o Yf X) +4] fla, 


fEHo 


= Roo (foo) , 
























































which proves the first statement. 
In order to prove the second statement, let us first observe that with probability 1, X; € x for i = 
1,...,n, and therefore: 


1“ 1“ 
VEE Ho, 5 LOPTE) =| Lo Fifi) (39) 
i=l i=l 


from which we can conclude, using the same line of proof as (38), that the following holds a.s.: 
Roo (o) B Roo (Foc) $ 
Let us now show that this implies the following equality: 
fio = fooix - (40) 


Indeed, on the one hand, || Fone laz < || foc lla by (27). On the other hand, (39) implies that 


o (Xi fyo (Xi) - 


Ms 


T i 1 
nd? (Yifo,o(Xi)) = i 


i=1 


As a result, we get RK, (fox) < Row (foc) = RK, iy) from which we deduce that olx 
and a both minimize the strictly convex functional Ri, on Hg, proving (40). We also deduce 
from Re (fox) = Row (fo) and from (39) that 


lho llag = Ilfo lar - (41) 
Now, using (40), (37), then (41), we can conclude the proof of the second statement as follows: 
Rọ o (Foc) = Roc (yoix) 
Ep, [0 Y Rox (X))] +All A ox lax 
= Ep (6 (Y fo (X))] +All fo llarg 
= Roo (foc) ’ 
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concluding the proof of Lemma 16. a 


Proof of Lemma 18 We prove the uniform upper bound on the variances of the excess-loss functions 
in terms of their expectation, using an approach similar to but slightly simpler than Bartlett et al. 
(2006, Lemma 14) and Steinwart and Scovel (2004, Proposition 6.1). First we observe, using (17) 


and the fact that Fo C ./0(0)/ABo, that for any f € Fo, 
IF lle < Voll f llas 


Koo(0) 
Sig 
As a result, for any (x,y) € X x {-1,+1]}, 
ler») |< OOL - 6 (fool) [+4] IF 2a, — fell 


0) 
ae (Œ) IF) - fool) | +M- foll f+ hsls a) 


sff ZN O) aga j) is- EE 


Taking the square on both sides of this inequality and averaging with respect to P leads to: 


YF E Fo, lel < (e( [0 ZORE i) If- hol, 63) 


On the other hand, we deduce from the convexity of p that for any (x,y) € X x{—1,+1} and 
any f € Fo: 


YOLE) AIFI, +00 0)) +A fool, 


























2 
6 2 ket) UAL ai H 
=o Ew) Ee +0 Eeg, 


Averaging this inequality with respect to P rewrites: 


Roo(f) +Rọ,o (fyo) >R (Fee), aI f= e 2, 
2 eas a?) Ho 








> Roo (foo) HAIZ Joa p 





liga 


where the second inequality is due to the definition of fy 5 as a minimizer of Rọ s. Therefore we get, 
for any f € Fo, 














Ep [gf] = Roo (f) — Roo (foc) 


r (44) 
> SIF- fro lba- 
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Combining (43) and (44) finishes the proof of Lemma 18 E 


Proof of Lemma 19 Following a path similar to (42), we can write for any f € Fo and any (x,y) € 
x x {-1,+1}: 


Isr) | <[O0F0)) -0 (feol)) | +4] ILF1%, — feo ll, 
L(V o Eo ETEA 


e Ko9(0) 
u(y n ) y a TAO 


| 2 








| @ 





Proof of Lemma 20 Let us introduce the notation Jy o f(x,y) := 0 (y(f(x)) and Loco f := lpo f + 
MIF lope for f € Hs and (x,y) E€ X x {—1, 1}. We can then rewrite (29) as: 


Go= {14,00 f —Lo,0° fc : fe Fo} i 


The covering number of a set does not change when the set is translated by a single function, 
therefore: 


N (B'Go,8,Lo(T)) = N (B'Ly.o° Fo,£,L2(T)) . 
Denoting now [a,b] the set of constant functions with values between a and b, we deduce, from the 
fact that A| f lly < (0) for f € Fo, that 
B~! Lyo 0 Fo C B~! lyo Fo + [0,B-'(0)] . 
Using the sub-additivity of the entropy we therefore get: 
log N (B~! Go,2£,L2(T)) < log N (Bo 'ly o Fo,€,L2(T)) +log x ([0,B~'0(0)] ,e,L2(T)) . (45) 


In order to upper bound the first term in the r.h.s. of (45), we observe, using (17), that for any f € Fo 
and x € Xx, 


LF) < Vell flag <V/ 2ON, 


and therefore a simple computation shows that, if u(x, y) :=B~!o(yf(x)) and u' (x,y) := Bto (yf! (x)) 
are two elements of Blo o Fg (with f, f! € Fo), then for any sample T: 


0)Ko 7 
lu-u ar se| K) s-r lam 


and therefore 


(0 


|= 
K 
a 
Se 
L 
T 
N 
© 


log N (B~ 'lpo Fo,€,L2(T)) <logN | Fo, BEL ( 
(46) 





—1 
< log N ng tet ( ar) ee) 
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Recalling the definition of B in (34), we obtain: 


-1 
BeL ( a=) o > 2eV/Ko , 





and therefore 
log N (Blo o,€,L2(T)) < log N (BŽ ,2€/Ke,L2(T)) . 


The second term in the r.h.s. of (45) is easily upper bounded by: 


log x ([0,B7'(0)] ‚£, L2(T)) < log (2) l 
and we finally get: 
log N (B~! Go,2£,L2(T)) < log N. (Bg ,2€/Ko,L2(T)) + log (2) l (47) 


We now need to upper bound the covering number of the unit ball in the RKHS. We make use of 
the following result, proved by Steinwart and Scovel (2004, Theorem 2.1, page 5): if BX denotes 
the unit ball of the RKHS associated with the non-normalized Gaussian kernel (20) on a compact 
set, then for all 0 < p < 2 and all 5 > 0 there exists a constant c, 5,4 independent of © such that for 
all € > 0 we have: 

log X (#8,8,1.(7)) < gypa OAIT ; (48) 


Now, using (21), we observe that 
BŽ =/KoBE , 


and therefore: 


log N (83 ,2€/Ko,L2(T)) = log (Veo 8S .2eVKo.La(T)) 


3 (49) 

= log N (#%,2e,L2(T)) « 
Plugging (48) into (49), and (49) into (47) finally leads to the announced result, after observing that 
the second term in the r.h.s. of (47) becomes negligible compared to the first one and can therefore 
be hidden in the constant for € small enough. a 


6. Some Properties of the L,-Norm-Regularized -Risk 


In this section we investigate the conditions on the loss function ọ under which the Bayes consis- 
tency of the minimization of the regularized -risk holds. In the spirit of Bartlett et al. (2006), we 
introduce a notion of classification-calibration for regularized loss functions 6, and upper bound the 
excess risk of any classifier f in terms of its excess of regularized -risk. We also upper-bound 
the L-distance between f and fo in terms of the excess of regularized -risk of f, which is useful 
to proove Bayes consistency in the one-class setting. 


837 


VERT AND VERT 


6.1 Classification Calibration 


In the classical setting, Bartlett et al. (2006, Definition 1, page 7) introduce the following notion of 
classification-calibrated loss functions: 


Definition 21 For any (n,a) € [0,1] x R, let the generic conditional -risk be defined by: 


Cy(@) := no(a) + (1 —) o(—a). 
The loss function 9 is said to be classification-calibrated if, for any n € [0,1]\{1/2}: 


inf Cn (Q) > inf Cy (a 
weR:o(2n—1)<0 n ) oeR n( ) 


The importance here is in the strict inequality, which implies in particular that if the global infimum 
of Cy is reached at some point , then a > 0 (resp. & < 0) if n > 1/2 (resp. n < 1/2). This 
condition, that generalizes the requirement that the minimizer of Cy(&) has the correct sign, is a 
minimal condition that can be viewed as a pointwise form of Fisher consistency for classification. 
In our case, noting that for any f € M , the Lz-regularized -risk can be rewritten as follows: 


Ryo(f) = 1 MOALA) NEDELE E) +f (a) fade, 
we introduce the regularized generic conditional ọ-risk: 
V(n,p, a) € [0,1] x (0, +00) x R, Cy» (Œ) := Cy (œ) + á 
as well as the related weighted regularized generic conditional -risk: 

V(n,p, 0) € [0,1] x [0, +00) x R, Grp (0) := PCy (0) +10? . 

This leads to the following notion of classification-calibration: 
Definition 22 We say that 6 is classification calibrated for the regularized risk, or R-classification- 
calibrated, if for any (n,p) € [0, 1]\{1/2} x (0, +) 


inf Cy,p(O) > inf Cy p (a 
oeR:o( 2-1) <0 nol ) oeR nol ) 


The following result clarifies the relationship between the properties of classification-calibration 
and R-classification-calibration. 


Lemma 23 For any function 6: R — [0,+¢°), (x) is R-classification-calibrated if and only if for 
any t > 0, (x) + tx? is classification-calibrated. 


Proof For any ọ : R — [0,+%) and p > 0, let (x) := (x) + Ax?/p and Cy the corresponding 
generic conditional -risk. Then one easily gets, for any & € R 


As a result, ọ is R-classification-calibrated if and only if, for any p, D is classification-calibrated, 
which proves the lemma. a 


Classification-calibration and R-classification-calibration are two different properties related to 
each other by Lemma 23, but none of them implies the other one for general non-convex functions 
, as illustrated by the following two examples. 
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Example 1 : A classification-calibrated, but not R-classification-calibrated function. Let (x) = 
1 on (—œ,—2], (x) = 2 on [-1,1], O(x) = 0 on [2,+¢), and ọ continous linear on |—2,-1] 
and {1,2]. Then Cy(a) is also continuous and piecewise linear on the intervals delimited by the 
points —2,—1,1,2, with values ņ on (—%,—2], 1—1 on |2, +%), and 2 on |—1,1]. As a re- 
sult, infaer Cy(a) = min(n, 1 — 1) and infger:o(2n—1)<0 Cn (Q) = max (n, 1 — N). This shows that > 
is classification-calibrated. However, as soon as p < 2h, the global minimum of Cy, (©) is reached 
for & = 0 and therefore: 


inf Cy.o(Q) = inf Cy o(a) = 2, 
oR: 2n—1)<0 np (Ot) oeR np (Ct) 


which shows that 9 is not R-classification-calibrated in this case. 


Example 2 : A R-classification-calibrated, but not classification-calibrated function. Let ọ : R — 
(0, +œ) be any function with negative right-hand and positive left-hand derivatives at 0, satisfying 


„Jim _¢(a) = lim (at) =0. (50) 
and 
Ya>0, (a) <o(-—@). (51) 


An example of a function that satisfies these conditions is (a) = e® for a < 0, (a) =e 7% for a > 
0. Because of (50), it is clear that such functions satisfy 


inf Cy (a) = inf Cy (a) = inf Cy (a) = 0 
inf Cy (G) = inf Cy (0) = inf Cy (a) = 0, 
which shows that they are not classification-calibrated. In order to show that they are R-classification- 


calibrated, it suffices to show by Lemma 23 that for any t > 0, (x) +tx? is classification-calibrated. ọ 
being nonnegative, the corresponding generic conditional risk 


Cy (a) = no(a) + (1 =) (a) + 207 





satisfies: 2, 7 


Q— — oo 


As a result, for any N # 1/2, the infimum of Cy over {a € R : a (2N — 1) < 0} is reached at some fi- 
nite point Oy. Moreover, Cy has negative right-hand and positive left-hand derivatives at 0, ensuring 
that the minimum is not reached at 0: Oy (2N — 1) < 0. This implies by (51) that 


(2n — 1) (b(n) — O(—Gm)) > 0. 
Combining this with the following equality holding for any a € R: 


Cy (2) — Cy (~a) = (20 — 1) (2) -— (-)) , 


we obtain R : 
As a result, 7 7 

inf Cy.p(@) = Cy (Gy) > Cy (—Gy) > inf Cy (ar) , 

acR:a(2n-1)<0 nip (Ot) = Cn (Gm) > Cn ( m) > inf n.p (Ot) 


showing that (x) +tx? is classification-calibrated, and therefore that is R-classification-calibrated. 
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6.2 Classification-Calibration of Convex Loss Functions 


The following lemma states the equivalence between classification calibration and R-classification 
calibration for convex loss functions, and it gives a simple characterization of this property. 


Lemma 24 For a convex function » : R — [0, +), the following properties are equivalent: 
1. is classification-calibrated, 
2. ọ is R-classification-calibrated, 
3. ọ is differentiable at 0 and 9'(0) < 0. 


Proof The equivalence of the first and the third properties is shown in Bartlett et al. (2006, The- 
orem 4). From this and lemma 23, we deduce that is R-classification-calibrated iff (x) + tx? is 
classification-calibrated for any t > 0, iff (x) +tx? is differentiable at 0 with negative derivative 
(for any t > 0), iff (x) is differentiable at 0 with negative derivative. This proves the equivalence 
between the second and third properties. a 


6.3 Some Properties of the Minimizer of the Ry o-Risk 


When ọ is convex, the function C,(@) is a convex function (as a convex combination of convex 
functions), and therefore Gy 5 (a) is strictly convex and diverges to +% in —oo and +c; as a result, 
for any (n, p) € [0,1] x [0, +), there exists a unique & (n, p) that minimizes Gy p on R. It satisfies 
the following inequality: 


Lemma 25 /f: IR — (0, +) is a convex function, then for any (n, p) € [0,1] x [0, +) and any & € 
R, 


Gn.p (&) — Gp (& (n, p)) > A (a — a (n, p)}? . (52) 


Proof For any (n,p) € [0,1] x [0, +œ), the function Gr p(&) is the sum of the convex func- 
tion pC (&) and of the strictly convex function 40°. Let us denote by C} (a) the right-hand deriva- 
tive of Cy at the point & (which is well defined for convex functions). The right-hand derivative of 
a convex function being non-negative at a minimum, we have (denoting O, := a (1, P)): 


PCy (Ox) + 2AM, > 0. (53) 
Now, for any & > Qx, we have by convexity of Cy: 
Cy (a) > Cy (04) + (6 — Ox.) Cy (0x) . (54) 
Moreover, by direct calculation we get: 
a? = 102 + 210 (a — Ox) +A (a — 0). (55) 
Mutliplying (54) by p, adding (55) and plugging (53) into the result leads to: 


Grp (4) — Grp (Ox) > A (a — 0)? . 


840 


CONSISTENCY AND CONVERGENCE RATES OF ONE-CLASS SVMS AND RELATED ALGORITHMS 


This inequality is also valid for & < Q,: starting this time from 
PCy (Os) +200, <0, 
where Ch denotes the left-hand derivative of C,, and from 


Cy(0t) > Cy (0x) + (0 — 04.) Cy (0) , 


which holds for any & < &, by convexity of Cn, we can draw exactly the same lines of reasoning as 
for the case & > a”. a 


From this result we obtain the following characterization and properties of the minimizer of 
the Ro o-risk: 


Theorem 26 /f ọ : R — [0,+¢9) is a convex function, then the function foo : R? — R defined for 
any x € Rf by 
fo,o(x) = a(x), P(x) 


satisfies: 
1. fọ o is measurable. 


2. fọ o minimizes the Rọ o-risk: 


R = inf R ; 
0.0 (fo,0) a o.0(f) 
3. For any f € M, the following holds: 
1 ok 
If- folz, < z (Roo(f) — Roo) - 


Proof To show that fọẹo is measurable, it suffices to show that the mapping (n,p) € [0,1] x 
[(0, +00) ++ a(n, p) is continuous. Indeed, if this is true, then fy is measurable as a continuous 
function of two measurable functions n and p. 

In order to show the continuity of (n, p) — a(n, p), fix (No, Po) € [0, 1] x [0, +) and the corre- 
sponding Op := (No, Po). Then, for any € > 0, there exists a neighborhood Be of (No, Po) such that 
for any (n,p) € Be, for any & € [O0 — £, 00 + £], 

re? 
| Gn,p (04) — Gro. (Œ) | < ae (56) 
To see that, first note that the function is continuous and thus bounded by some constant A 
on [Oo — £, Œo + £], and therefore, for any & in [Qo — €, & +£], we have 


| Grp (0) — Grop (&)| = | (MP — nopo) (¢ (a) — p (—a)) + (P — Po) 6 (—ax) | 
< 2A(|np—NoPo|+|P—Pol) - 





Hence, (56) holds by taking, for instance, 


Be:= {(n,p) ER? : [np — nopo] < Ae?/12A , |P — Po] < àe? /12A} . 
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Now, applying (56) successively to Œo + € and to Mo, then using (52), we easily obtain that for 
any (1),P) € Be, 
re? 


In the same way, applying (56) successively to Œo — € and to Ot, then using (52), we obtain that for 


any (N, p) € Be, 


he 
Gn,p (Qo ae €) > Gn,p (Qo) + Be k 


This reveals the existence of two points around Qo, namely Op — € and Qo +€, at which the func- 
tion Gy, takes values larger than Gy,» (Œo). By convexity of Grp, this implies that its minimizer, 
namely a (n, p), is in the interval [0o — £, Œo + £], as soon as (N, p) € Be, which concludes the proof 
of the continuity of (n,p) — a (n, p), and therefore of the measurability of 9,0. 

Now, we have by construction, for any f E€ M : 


Vx ER, Ciaro (fo.0(*)) < Griwa (F(@)) 


which after integration leads to: 
Ro0(fo.0) < Roolf) ; 
proving the second statement of the theorem. 


Finally, for any f E€ M , rewriting (52) with a = f(x), p = p(x) and n = (x) shows that: 


VxER*, Gacy p(x) FE) — Grwp) (Fo.0(x)) =A (F(@) — fyo) ; 


which proves the third statement of Theorem 26. E 


6.4 Relating the Ro o-Risk with the Classification Error Rate 


In the “classical” setting (with a regularization parameter converging to 0), the idea of relating the 
convexified risk to the true risk (more simply called risk) has recently gained a lot of interest. Zhang 
(2004) and Lugosi and Vayatis (2004) upper bound the excess-risk by some function of the excess Q- 
risk to prove consistency of various algorithms (and obtain upper bounds for the rates of convergence 
of the risk to the Bayes risk). These ideas were then generalized by Bartlett et al. (2006), which we 
now adapt to our framework. 

Let us define, for any (n, p) € [0,1] x (0, +2), 


M (n, P) := min Crp (a) = Cno (a(n, p)), 


and for any p > 0 the function Wp defined for all © in [0, 1] by 


vp (8) =00) -m (+50). 


The following lemma summarizes a few properties of M and Wp. Explicit computations for some 
standard loss functions are performed in Section 7. 
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Lemma 27 If ọ: R — [0,+¢°) is a convex function, then for any p > 0, the following properties 
hold: 


1. the function n +> M(n,p) is symmetric around 1/2, concave, and continuous on [0,1]; 
2. Wp is convex, continuous, nonnegative, and nondecreasing on [0,1], and y(0) = 0; 
3. if0 <p <T, then Wp < Wr on (0, 1]; 
4. 0 is R-classification-calibrated if and only if Wp(8) > 0 for 6 € (0, 1]. 
Proof For any p > 0, let 


Ax 
9 (x) := O(x) + "pr 


As already observed in the proof of Lemma 23, the corresponding generic conditional ¢p)-risk Ch 
satisfies 


Cy O= Cip (@) - 


Èp being convex, the first two points are direct consequences of Bartlett et al. (2006, Theorem 4 & 
Lemma 6). In particular,the function Wp is nondecreasing due to the fact that it is minimal at 0 and 
convex on [0, 1]. 

To prove the third point, it suffices to observe that for 0 < p < Tt we have for any (n,Q) € 
[0,1] x R: 


Cap (Ct) —Cya(ct) = 0? G - =) >0, 


which implies, by taking the minimum in a: 
M(n,p) =M(n,7), 


and therefore, for 6 € [0,1] 
Wp (8) < Wr (8). 


Finally, by lemma 24, is R-classification-calibrated iff ọp is classification-calibrated (because 
both properties are equivalent to saying that ọ is differentiable at 0 and 6’(0) < 0), iff wp(@) > 0 
for 8 € (0, 1] by Bartlett et al. (2006, Theorem 6). a 


We are now in position to state a first result to relate the excess Rg o-risk to the excess-risk. The 
dependence on p(x) generates difficulties compared with the “classical” setting, which forces us to 
separate the low density regions from the rest in the analysis. 
Theorem 28 Suppose 6 is a convex classification-calibrated function, and for any € > 0, let 

Ag := fx ER": p(x) <e}. 


For any f E€ M the following holds: 


R(f)—R* < inf {P (Ae) + We! (Roo (F) — Roo) } (57) 
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Proof First note that for convex classification-calibrated functions, We is strictly increasing on (0, 1]. 
Indeed, by Lemma 27, it is convex and reaches its unique minimum at 0 in this case. Since We is 
also continuous on (0, 1], it is therefore invertible, which justifies the use of y, ! in (57). 

Fix now a function f € M , and let U(x) := 1 if f(x) (2n(x) —1) < 0, 0 otherwise (U is the 
indicator function of the set where f and the optimal classifier disagree). For any € > 0, if we 
define B; := R? \Ae, we can compute: 


Raolf) -R= [Crow CO) -MM Pa)) 


TD 

aN 
ta 
a 
ta 





= f Yo (26) -1)UE)Pa)dx 

> f wpa (lm) — 1U EPs 

> ic We (|2m(x) — 1) U(x)p(a)dx 

= f ve UAM- 1) plea 
p(x) y 

aon ORUE S 


PB ( zrg J, O- Upas) 
> We eel [, 2nts)-1]0 poar) 
=ve(f.,l2n)—1]U)par— f |20- 1U paas) 


> ye ( fants) —1] U(a)pCa)ar— P (Ae) ) 
= ye (R(F) —R* -P (4e) 


where the successive (in)equalities are respectively justified by: (i) the definition of Rẹ, and the sec- 
ond point of Theorem 26; (ii) the fact that U < 1; (iii) the fact that when f and 2n — 1 have different 
signs, then Cy,» (f) > Cn.p (0) = 0 (0); (iv) the definition of yp; (v) the obvious fact that Be C R4; 
(vi) the observation that, by definition, p is larger than € on Bg, and the third point of Lemma 27; (vii) 
the fact that ye(0) = 0 and U(x) € {0,1}; (viii) a simple division and multiplication by P(Be) > 0; 
(ix) Jensen’s inequality; (x) the convexity of We and the facts that y(0) = 0 and P(Be) < 1; (xi) the 
fact that Be = R?\A¢; (xii) the upper bound | 2n(x) — 1 |U (x) < 1 and the fact that We is increasing; 
and (xiii) a classical inequality that can be found, e.g., in Devroye et al. (1996, Theorem 2.2, page 
16). Composing each side by the strictly increasing function yz! leads to the announced result. W 
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6.5 Proof of Theorem 4 


Theorem 4 is a direct corollary of Theorem 28.4 Indeed, keeping the notation of the previous 
section, let us choose for any ô > 0 an £ small enough to ensure P (Ag) < 6/2, and N € N such that 
for any n >N, 


ô 
Roo (h) -Ro <We(5) 


Then a direct application of Theorem 28 in this case shows that for any n > N, R(fn) — R* < 4, 
concluding the proof of Theorem 4. a 


This important result shows that any consistency result for the regularized -risk implies con- 
sistency for the true risk, that is, convergence to the Bayes risk. Besides, convergence rates for 
the regularized -risk towards its minimum translate into convergence rates for the risk towards the 
Bayes risk thanks to (57), as we will show in the next subsection. 


6.6 Refinements under a Low Noise Assumption 
When the distribution P satisfies a low noise assumption as defined in section 2, we have the fol- 


lowing result: 


Theorem 29 Let 6 be a convex loss function such that there exist («,B,v) € (0, +09)? satisfying: 
V(e,u) € (0, +0) xR, we! (u) < kue™. 


Then for any distribution P with low density exponent y, there exist constant (K,r) € (0, +%) such 
that for any f E M with an excess regularized -risk upper bounded by r the following holds: 


By. 
R(f) —R* < K (Roo (f) — Roo) ™ - 
Proof Let (c2,£0) € (0, +)? such that 
Ve € [0,£0], P (Ae) < e208", (58) 


and define 
YEY 1 


laad 12 
r= Eo K Pei. (59) 
Given any function f € M such that ò= Rọ o (f) — Rọ o < r, let 


E 
E i= K e, “Or, (60) 


Because 6 < r, we can upper bound € by: 


1 


~ FEV 


1 B 
EKG, ry 


(61) 
= £p. 





4. We note that after this work was submitted, a related analysis has been proposed in Steinwart (2005b). The latter 
provides a very general framework, which in particular allows to derive Theorem 4 without the use of Theorem 28. 
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Combining (61) and (58), we obtain: 


“ies (62) 
< KPC PSH, 
On the other hand, 


We | (8) < rfe” 
v (63) 
= Kase” Sie ; 
Combining Theorem 28 with (62) and (63) leads to the result claimed with the constant r defined 
in (59) and 


v 


Ee 
K := 26 c". 


7. Consistency of SVMs 


In this section we illustrate the results obtained throughout Section 6 for a general loss function 0, 
in particular the control of the excess R-risk by the excess Rọ,o-risk of Theorem 29, to the specific 
cases of the loss functions used in 1- and 2-SVM. This leads to the proof of Theorem 6 in Section 
7.3. 


7.1 The Case of 1-SVM 
Let (a) = max (1 —@,0). Then we easily obtain, for any (n,p) € [—1, 1] x (0, +): 











n(1—a@) +Aa?/p if & € (—%,—1] 
Cho (®) =< n(1—a)+(1—n)(1+0)+A0?/p ifae[—-1,1] 
(=n) (1+a) +Aa?/p if œ € [1, +). 


This shows that Cy p is strictly decreasing on (—%, —1] and strictly increasing on [1, +20); as a result 
it reaches its minimum on |—1, 1]. Its derivative on this interval is equal to: 


2 
va € (—1,1), Gp(a)= am —2n. 


This shows that Cy.» reaches its minimum at the point: 


—1 ifn <1/2—A/p 
a(n,p)= 4 (n—-1/2)p/A ifn [1/2—A/p,1/2+A/p] (64) 
1 ifn >1/2+A/p 


and that the value of this minimum is equal to: 


2N +A/p ifn < 1/2— A/p 
M(n,p)=41-p(n—1/2)°/A_ ifn € [1/2—A/p,1/2+A/p] 
2(1—n) +A/p ifn >1/2+A/p 
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From this we deduce that for all (p,6) € (0, +) x [—1, 1]: 


(a) — [PPA if0<0<2A/p, 
P` )e-A/p  if2à/p<0<1 


whose inverse function is 


—1 


Wo (u) = 


Ree if0<w<A/p, (65) 


u+aA/p  ifu>A/p. 


7.2 The Case of 2-SVM 
Let (a) = max (1 — 0,0). Then we obtain, for any (n,p) € [—1, 1] x (0, +): 











n(1—a)*+A02/p if a € (—e0, —1] 
Cnel) = 4 n(1—a)* + (1-9) (1+0) +0?/p ifa [-1,1] 
(1—n) (1 +a)? +202/p if a € [1, +00). 


This shows that Cy» is strictly decreasing on (—%, —1] and strictly increasing on [1, +2); as a result 
it reaches its minimum on [—1, 1]. Its derivative on this interval is equal to: 


Vae(-1,1), Cyp(a)=2 (1 + =) a+1—-—2n. 
This shows that Cy,» reaches its minimum at the point: 
a(n,p)= (20-1) z: (66) 
A+p 


and that the value of this minimum is equal to: 





M(n,p) =1-(2n-1)? © 


+p 
From this we deduce that for all (p,0) € (0, +æ) x [—1, 1]: 
P a 
6) = — 9 
whose inverse function is 
À 
Wo. (u) = (1 + =) u. (67) 


Remark 30 The minimum of Cy.) being reached on (—1,1) for any (n,p) € [0,1] x (0, +), the 
result would be identical for any convex loss function o that is equal to (1 — a) on (—%,1). Indeed, 
the corresponding regularized generic conditional $9-risk would coincide with Cy, on (—1,1) and 
would be no smaller than Cy p outside of this interval; it would therefore have the same minimal 
value reached at the same point, and consequently the same function M and w. This is for example 
the case with the loss function used in LS-SVM, (0) = (1 — a)’. 
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7.3 Proof of Theorem 6 


Starting with (a) = max (1 —@,0), let us follow the proof of Theorem 29 by taking B = v = 1/2 
and x = 2V/X. For r defined as in (59), let us choose 


1 
: cot Bryty 
ry =min| | —p- . 
K2 B 


For a function f € M , choosing € as in (60), 6 < rı implies 
1 
WHY \ Be 
ô< (2 = ) 
K2 B 


pas Ee 
a (evar ae ayah) Brytv 





and therefore: 
1 


T 


62 B< 


on | > 


This ensures by (65) that for u = 8278, one indeed has 


v 


Wo" (u) = kube”, 
which allows the rest of the proof, in particular (59), to be valid. This proves the result for 1, with 
a 
Ky =2x2™1VA CS. 
For (0) = max (1 — 0,0)* we can observe from (67) that, for any £ € (0,€0], 
| u 
We (u) < 4/ (À +0) K 
and the proof of Theorem 29 leads to the claimed result with r2 = r defined in (59), and 


VES H 
K> =2~x (A+ €9) 77 Cine 


Remark 31 We note here that £ can be chosen as small as possible in order to move the constant Ky 
as close as possible to its lower bound: 


1 
= Y AAT 
Ky =2x AI c. 


but the counterpart of decreasing K is to decrease r too, by (59). We also notice the constant 
corresponding to the 1-SVM loss function is larger than that of the 2-SVM loss function, by a factor 
2 


EA 
of up to 2771 
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8. Consistency of One-class SVMs for Density Level Set Estimation 


In this section we focus on the one-class case: n is identically equal to 1, and P is just considered 
as a distribution on Rf, with density p with respect to the Lebesgue measure. The first subsection is 
devoted to the proof of Theorem 8, and the second subsection to the proof of Theorem 9. 


8.1 Proof of Theorem 8 


Theorem 8 easily follows from combining some results given in this paper. First, it follows from 
(64) that, in the one-class case where n = 1 on its domain, the asymptotic function fọ, equals the 
truncated density p,. Then, using Lemma 7, we get the following bound: 


x 1 i 
lfs — Palin < z (Roofs) — Roo) - 


Finally, under the assumption limg_.9 © (Pa, ©) = 0, using Theorem 3 with, for instance, p = 1, 6 = 
lo= (1/n) 1/4, 0 = (1/n) 1/4, and x = log (n), we deduce that for any € > 0, 


P{\lfo—Pallu, =e} 0 


as n —> ©, 


8.2 Proof of Theorem 9 


Theorem 9 directly follows from combining Theorem 8 with Theorem 33, which is stated and 
proved at the end of this section. To prove Theorem 33, it will be useful to first state Lemma 32. 

Before going to this point, let us recall some specific notation in the context of density level set 
estimation. The aim is to estimate a density level set of level u, for some u > 0: 


C, := frer’ : p(x) >u} (68) 


The estimator that is considered here is the plug-in density level set estimator associated with fs, 
denoted by Cy: 


a~ 


Ĉ,:= fx ER! : 2f (x) > u} . (69) 


Recall that the asymptotic behaviour of fe in the one-class case is given in Theorem 8: fo converge 
to Pa, which is proportional to the density p truncated at level 24. Taking into account the behaviour 
of p}, we only consider the situation where 0 < u < 2A < suppa (P) = M. The density p is still 
assumed to have a compact support S C x. To assess the quality of Ci we use the so-called excess 
mass functional, first introduced by Hartigan (1987), which is defined for any measurable subset C 
of R? as follows: 

Hp(C) := P (C) — uLeb (C) , (70) 


where Leb is the Lebesgue measure. Note that Hp is defined with respect to both P and u, and that 
it is maximized by C„. Hence, the quality of an estimate C depends here on how its excess mass is 
close to this of Cp. 

The following lemma relates the L2 convergence of a density estimator to the consistency of the 
associated plug-in density level set estimator, with respect to the excess mass criterion: 
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Lemma 32 Let P be a probability distribution on R? with compact support S C X. Assume that P 
is absolutely continuous with respect to the Lebesgue’s measure, and let p denote its associated 
density function. Assume furthermore that p is bounded on S. Consider a non-negative density 
estimate § defined on R¢. Then the following holds 


A~ 


Ap(Cy) — Hp (C) < K5||ĵĝ -pP ||, , (71) 


where C is the level set of ĵ at level u, and 


2,/ mill llc. + caste; 
= a . 
Proof To prove the lemma, it is convenient to first build an artificial classification problem using the 
density function p and the desired level u, then to relate the excess-risk involved in this classification 
problem to the excess-mass involved in the original one-class problem. Note that this technique has 
already been used in Steinwart et al. (2005). Let us consider the following joint distribution Q 
defined by its marginal density function 








5 











mp(x) +(1—m)— ifxeSs, 
g(x) = pO) + (T= ray | (72) 
0 otherwise , 
and by its regression function 
mp (x) 
No(x) := ,xes, (73) 
mp(x) + (1 - m) rane 
where m is chosen such that (x) 
p(x 
x)= i (74) 
ao 
that is i 
(75) 


m:= ————__. 
1+pLeb(S) 
In words, in the above artificial classification problem, the initial distribution P stands for the 
marginal distribution of the positive class, and the negative class is generated by the uniform distri- 
bution over the support of P. The mixture coefficient m is determined by the initially desired density 
level u. The corresponding Bayes classifier, which is the plug-in rule associated with no, is denoted 
by h*. 
Furthermore let us define Ĥo := 6 / (6 + u), which stands for an estimate of No in our artificial clas- 
sification problem, and h as the plug-in classifier associated with fo: ĥ := sign (2o — 1). Then it 
is straightforward that h* is the indicator function of C,, and that i is the indicator function of C. 
Moreover T 
R(h) —R(h*) =m (He(c*) —Hp(C)) 





Indeed, 
R) = Q(h(X) #Y) 
= QY =-1)O(A(X) =1¥ =-1) + QY = 1)Q(A(X) = -1|¥ = 1) 
Leb (Cc) 2b 
= (l1-m) Leb (5) +m(1—P(C)), 
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and, similarly, 
Leb (Cu) 


Rh) = (1-m Te 





m(1—P(Cy)) , (76) 


which proves the claim. 
Now, the following can be derived, starting from an equality that can be found in Devroye et al. 


(1996, page 16): 
zo | 


Eo [Ino — fo 
sa a E Vaa i 
ne BC) +P) +H) 
@)-p@\?,\"" 
A UU. (ae T) as) 
VA 


AA 
2— |ô -p llr » 
u 





R(h)—R(h*) = 


N 








Lege 


Py 1/2 


1 
No 5 

















IA 
N 








IA 


IA 


where A is a positive uniform upper bound on q(x), for instance A = m|| p ||z,, + (1 — m) /Leb(S). 
Combining the previous equality with the last inequality concludes the proof. a 


We could just directly apply this lemma to fo, pı and the distribution P, defined’ through p}, but 
this would prove the consistency of fs with respect to the excess mass Hp,, which is different from 
the criterion Hp of interest. The following lemma implies that the plug-in density level set estimator 
at level 0 < u < 2A based on the one-class SVM estimator is indeed consistent with respect to the 
excess mass Hp. 


Theorem 33 Let f be a non-negative squared integrable function that estimates p, (as defined in 
Equation 7). Let 0 < u < 2X. Let C denote the level set of 2Xf at level u. Then 


Hp(Cy) — He(C) < Koll f — Pallas (77) 
where Ks > 0 depends neither on ©, nor on n. 
Proof Let us introduce the following estimator: 
P :=2Af+ pry, (78) 


where the function fp, is defined as follows: 


pE te —2h if p(x) > 22, a 


0 otherwise, 





5. Note that P} is not a probability distribution, since the function Pp, does not integrate to 1. Still, the excess-mass 
remains well-defined. 
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and let Č denote the level set of ĵ at level u. It can be checked that 6 — p = 2A ( f- Pa), implying 
that 


ILP -p Illz =2Al| Î— pa Iz - (80) 
Hence, using Lemma 32, we have 
Hp (Cy) — Hp (Č) < Ks||Ď -P ||z, = 24Ks|| f — px lz » (81) 
leading to 
Hp(Cy) — Hp(Ô) < 2hel| f — pa lza + | He (Ô) — He(C)) . (82) 


The last thing to do is to bound |© — Hp(Č) |. Since P has a bounded density w.r.t. the 
Lebesgue’s measure, 
| He(C) - He (Č) | < (u-+M)Leb (Cac) (83) 


By construction of f, if Cz, denotes the level set of p at level 2A, and Cy, its complementary in Rf, 
then we have CN Cy, = ČN C>} and 2Af > u —> Î > u. Hence 


Leb(€ac) = 1 _ lateuapzn} 


[. | fonfept 


WEA a 
i. 2r—p  (2Af<u} 


: TE 
e= 2 = 2. 
2h —u (f pa P) ) 


Qn A 
Ta pl 7 Ple . 


IA 


IA 


IA 


This concludes the proof. E 
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